Girsanov Based Direct Policy Gradient Methods
نویسندگان
چکیده
Despite the plethora of reinforcement learning algorithms in machine learning and control, the majority of the work in this area relies on discrete time formulations of stochastic dynamics. In this work we present a new policy gradient algorithm for reinforcement learning in continuous state action spaces and continuous time. The derivation is based on successive application of Girsanov’s theorem and the use of the Radon Nikodým derivative as formulated for markov diffusion processes. The resulting policy gradient is reward weighted with the reward taking the form of a path integral. We apply the resulting algorithm in two simple examples for learning attractor landscapes in rhythmic and discrete movements.
منابع مشابه
Absolute Continuity of Symmetric Markov Processes
We study Girsanov’s theorem in the context of symmetric Markov processes, extending earlier work of Fukushima-Takeda and Fitzsimmons on Girsanov transformations of “gradient type”. We investigate the most general Girsanov transformation leading to another symmetric Markov process. This investigation requires an extension of the forward-backward martingale method of Lyons-Zheng, to cover the cas...
متن کاملScaling Reinforcement Learning Paradigms for Motor Control
Reinforcement learning offers a general framework to explain reward related learning in artificial and biological motor control. However, cur-rent reinforcement learning methods rarely scale to high dimensional movement systems and mainly operate in discrete, low dimensional domains like game-playing, artificial toy problems, etc. This drawback makes them unsuitable for application to human or ...
متن کاملMulti-Batch Experience Replay for Fast Convergence of Continuous Action Control
Policy gradient methods for direct policy optimization are widely considered to obtain optimal policies in continuous Markov decision process (MDP) environments. However, policy gradient methods require exponentially many samples as the dimension of the action space increases. Thus, off-policy learning with experience replay is proposed to enable the agent to learn by using samples of other pol...
متن کاملRegularized Policy Gradients: Direct Variance Reduction in Policy Gradient Estimation
Policy gradient algorithms are widely used in reinforcement learning problems with continuous action spaces, which update the policy parameters along the steepest direction of the expected return. However, large variance of policy gradient estimation often causes instability of policy update. In this paper, we propose to suppress the variance of gradient estimation by directly employing the var...
متن کاملNatural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
In this paper, we propose an efficient algorithm for estimating the natural policy gradient using parameter-based exploration; this algorithm samples directly in the parameter space. Unlike previous methods based on natural gradients, our algorithm calculates the natural policy gradient using the inverse of the exact Fisher information matrix. The computational cost of this algorithm is equal t...
متن کامل